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ABSTRACT 

We describe a maximum-likelihood method for determining the mass distri- 
bution in spherical stellar systems from the radial velocities of a population of 
discrete test particles. The method assumes a parametric form for the mass 
distribution and a non-parametric two-integral distribution function. We apply 
the method to a sample of 161 globular clusters in M87. We find that the mass 
within 32 kpc is (2.4 ± 0.6) x 10 12 M Q , and the exponent of the density profile 
p oc r~ a in the range 10 — 100 kpc is a — 1.6 ± 0.4. The energy distribution 
suggests a few kinematically distinct groups of globular clusters. The anisotropy 
of the globular-cluster velocity distribution cannot be determined reliably with 
the present data. Models fitted to an NFW potential yield similar mass estimates 
but cannot constrain the concentration radius r c in the range 10 — 500 kpc. 

Subject headings: Methods: statistical - galaxies: kinematics and dynamics - 
galaxies: fundamental parameters - galaxies: individual (M87) - galaxies: star 
clusters - cosmology: dark matter 

1. Introduction 

The mass distribution in elliptical galaxies is difficult to measure, in part because el- 
lipticals do not contain gas or star disks like those found in spiral galaxies. The mass 
distribution is particularly uncertain beyond the effective radius, where the surface bright- 
ness fades rapidly. Fortunately, globular clusters (GCs) are found at large distances in many 
elliptical galaxies. For example, more than 8 x 10 3 GCs are believed to be present in M87 
outside 3 times the effective radius R e ~ 7 kpc (McLaughlin 1999a; Zeilinger, M0ller & 
Stiavelli 1993). GCs in nearby galaxies provide relatively bright pointlike sources for which 
radial velocities can be determined. By analyzing the statistics of the positions and radial 



- 2 - 



velocities of GCs, we may constrain the mass profiles of their host galaxies and explore the 
dark matter distribution in the outer parts of elliptical galaxies. Planetary nebulae also 
provide excellent kinematic probes for similar reasons (e.g. Douglas et al. 2002). 

The database of extragalactic GCs with radial velocities has been steadily growing in 
recent years. The most intensively observed systems are NGC 1399 (Dirsch et al. 2004) and 
M87. In this paper we examine the latter system. Harris (1986) and McLaughlin (1999a) 
observed the density profile of GCs in M87 out to 110 kpc. Mould et al. (1990) compiled a 
list of 43 GCs with radial velocities, from which Merritt & Tremblay (1993) derived a mass 
of 5 - 10 x 10 12 M within 50 kpc. They also estimated that about 200 and 1000 GCs would 
be required to derive accurately the slope of the mass profile of the galaxy and the velocity 
anisotropy of the GCs, respectively. Cohen and Ryzhov (1997) and Cohen (2000) obtained 
radial velocities for 221 GCs. This dataset was used to study the spin (Cohen and Ryzhov 
1997; Cohen, Blakeslee & Ryzhov 1998; Kissler-Patig & Gebhardt 1998) and mass profile of 
M87 (Romanowsky & Kochanek 2001). 

Although there is significant rotation in the M87 GC system, the fractional kinetic 
energy in rotation (VLR/a) 2 ~ 20% is relatively small, so it should be safe to ignore rotation 
in our analysis and adopt the assumption that the system is spherical (Cote et al. 2001). Thus 
the data pairs [R, v z ] contain all available dynamical information, where R is the projected 
distance from the GC to the center of the galaxy, and v z is its velocity in the line of sight. 
To convert angular distance to physical distance, we adopt a distance to M87 of 16.3 Mpc 
(i.e. 79 pc arcsec -1 ) following Cohen (2000). 

Maximum likelihood analysis provides a powerful tool to analyze discrete data. It was 
used to derive the mass distribution of spherical galaxies first by Merritt (1993) and Merritt & 
Saha (1993). We describe our method for deriving spherical mass distributions from discrete 
data in Section 2. We use simulated data to test our method in Section 3. In Section 4 
we constrain the mass profile of M87 and the distribution function (hereafter DF) of its 
GCs. The mass distribution of M87 has been determined from GC kinematics by several 
authors, including Merritt & Tremblay (1993), Cohen and Ryzhov (1997) and Romanowsky 
& Kochanek (2001). The study most similar to ours is that of Romanowsky & Kochanek 
(2001). We compare our method and results to theirs in Section 5. Section 6 contains 
conclusions. 
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2. Methods 

We wish to constrain the potential $(r) of a spherical system from a set of discrete data 
pairs [Ri,v Z i\. In a spherical system, the DF depends on at most two integrals of motion, 
energy E and scalar angular momentum L, and thus takes the form f(E, L). We normalize 
so that 

J f(E,L)d 3 xd 3 v = 1. (1) 

We introduce a Cartesian coordinate system (x, y, z) with origin at the galactic center, and 
z- axis along the line of sight. In this system, the projected radius is 



R = + y2, ( 2 ) 

The probability that a given GC is found in the interval dRdv z is 2nRg(R,v z )dRdv z , where 

g(R,v z ) = J f(E,L)dzdv x dv y . (3) 

Because of spherical symmetry, we may assume without losing any information that all GCs 
lie in the x — z plane and x > 0. Thus we have 

E = $( r ) + ^(v 2 x + v 2 y + v 2 z ), 

L = ^/(v z R-v x zy + vl(R2 + z2), (4) 

Given a potential $(r) and a DF f(E,L), the likelihood of observing the data pairs 
[Ri,v zi \ is 

LH($,f) = LH{\R t1 v z ^,f) 

= Y[g(R i ,v lsi \$J)-2nR i 

i 

oc Y[g(Ri,v zi \<f>,f). (5) 

i 

The maximum likelihood method assumes that the best estimate of the potential and DF is 
the one that maximizes LH ($,/). More generally, Bayes' theorem states that if we know 
some prior information H which gives a prior probability P(<&, f\H), we should maximize 
the posterior probability P($, f\H)LH(<&, f) rather than the likelihood. 

This formalism can easily be generalized to account for selection effects and observational 
errors in radial velocities. For example, in most surveys we observe GCs only in a limited 
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range of radii (R s0 < R < R s i). To account for this limitation, we simply modify the 
normalization to require that 



/•Rsi r 

I g(R,v z )2nRdR / dv z = 1. (6) 

JRsO J 



To account for the errors in radial velocities, we just need to calculate LH with g(R,v z 
replaced by 



9'(R,v z ) = — L- [ g(R,v z )e-( v - v *) 2 / 2 °v 



dv. (7) 



Here we have assumed that the error a v is the same for all GCs, although the method is easily 
generalized to individual measurement errors. Note that our method also probes, though 
with less power, the potential beyond the maximum survey radius R s i, both because we 
observe the projected distribution, which includes GCs with radial distances to the galactic 
center larger than R s ±, and because some GC orbits have apocenters outside R s \. 

In principle, all physically possible potentials and DFs should be explored to maximize 
LH(<&,f). The most general way to do so is to assume non-parametric forms for both 
potentials and DFs, but this is cumbersome and computationally expensive. Instead we 
optimize within a family of parametrized potentials: we assume an analytical form $(r, X) 
and infer the parameters X (which can be viewed as a row vector). For example, we may 
assume a power-law density-potential pair 



P(r) = Po 



•M = t2 ilG "f («<S), (8) 

(2 - a){3 - a) \r J 

where r is an arbitrary radius, and then X = {p ,a}. We also investigate the Navarro, 
Frenk & White (1997) density-potential pair, 

p( r ) = — — r~2 > 

r I r \ 

r c V r c ) 

In (l + - 

$(r) = -47rGp r 2 c V r Tc/ . (9) 
Here r c is the concentration radius and X = {p , r c }. 



r 
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If the DF f(E,L) were known, the problem of optimizing the parameters X would be 
easy to solve. We would calculate LH as a function of X and figure out where LH peaks. 
However, in practice the DF is an unknown function, and we have to derive the DF and X 
simultaneously. This can be done by parametric or non-parametric methods. 

In the parametric method (Merritt & Saha 1993), we choose a limited number of basis 
functions fk(E, L) (i.e. a complete set of basis functions truncated to some order N), which 
are smooth in phase space. Thus, the DF is approximated as 

f(E,L) = J2^kf k (E,L), k = l,...,N. (10) 

k 

This method ensures that the derived DF is smooth; however, it is difficult to ensure that 
f(E, L) > for all (E, L) since neither w k nor f k is necessarily non-negative. Furthermore, 
if N is small we cannot ensure that equation (10) gives a good approximation to the actual 
DF. Of course, we can always increase N to improve the approximation, but then we cannot 
ensure that the best-fit DF will be smooth. 

In the non-parametric method (Merritt 1993), we divide the E — L space into N E x N L 
bins, which are denoted by the double index mn, m = 1, N E , n = 1, N L . Notice that 
the DF is isotropic if Nl = 1 and otherwise may be anisotropic. Then we construct a set of 
top-hat functions, 

h ™( E > L ) = { o otherwise, (U) 

where V mn = f^ n mn rf 3 x<i 3 v is the phase-space volume within the survey limits that is 
associated with bin mn. Thus, we have 



/ 



h mn (E,L)d 3 xd 3 v = 1, (12) 
f(E,L) = J2 w mnh mn (E,L), (13) 

m,n 

p f W mn /V mn if Vmn 7^ (14) 

mn \ otherwise, 

^2w mn = 1, (15) 

m,n 

W = {w mn }. (16) 

Here w mn is simply the fraction of GCs in bin mn, f mn is the phase space density of GCs in 
bin mn, and f(E,L) and h mn (E,L) are normalized so that their integrals over phase space 
within the survey limits are unity. 
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In our calculation, the bin ran is defined by 

E m -i < E < E m , 

71 — 1 71 

L c (E m ) <L< —L c (E m ), (17) 

where E m _i and E rn are the lower and upper energy limits of the bin, and L C (E) is the 
maximum angular momentum at energy E, corresponding to a circular orbit. These bins 
cover all of the allowed region in E — L space and may also cover some unallowed regions. 

Each h mn (E,L) gives a normalized distribution in (R,v z ) space, 

g mn (R,v z ) = J h mn (E,L)dv x dv y dz. (18) 

Notice that g mn (R,v z ) depends on $ (thus X) but is independent of the DF specified by 
W. An algorithm for evaluating g mn (R,v z ) efficiently is shown in Appendix A. To account 
for errors in radial velocities, the distribution should be convolved with a one-dimensional 
Gaussian function as in equation (7). We use the convolved distribution hereafter and drop 
the prime symbol for simplicity. 

Due to the linearity of equations (3) and (13), we have 

g(R,v z ) = ^2w mn g mn (R,v z ), 

m,n 

LH(X,W) oc l[g(Ri,v zi ) 

i 

= TT ^ w mn g mn (Ri, v zi ). (19) 

i m,n 

Therefore, for a given set of observations [Ri,v Z i], the likelihood is a function of X and W 
only. For a fixed potential specified by the parameters X, we can calculate g mn {Ri,v Z i), 
then maximize Lif(X, W) with respect to W. We may then vary X to search for a global 
maximum LH(X, W). 

An advantage of this method is that we need only ensure that all w mn are non-negative 
to guarantee that f(E, L) > 0. However, this method does not generally give smooth DFs. 
The derived DFs look like the sum of a set of 5-functions (Merritt 1993). This is a drawback, 
since to some extent the irregularities in the DFs are fitting the statistical fluctuations in the 
data rather than the signal. In other words it is an ill-posed problem to infer g(Ri, v zi ) from 
a small set of data points. One way to approach this problem is to maximize a "penalized 
likelihood" instead, which can be defined using entropy maximization (Richstone & Tremaine 
1988) or regularization (Merritt 1993; Rix et al. 1997). 
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By adding an entropy term, we maximize the function 



T = InLH + aS, 

S = 



J C(/)d 3 xd 3 v, 



(20) 
(21) 



where a is a parameter chosen to control the degree of smoothness of the DF, and C(f) 
is a convex function of / such as / In /. Adding the entropy term tends to force the DF 
to be uniform in phase space (in the limit of large a, the data become irrelevant and the 
maximum value of T occurs for / = const). This tendency is particularly undesirable in our 
case because there is a large phase volume at high energy, so that large a tends to place 
more GCs in a high energy state. We have done simulations to test this method and found 
that the derived DF is either too irregular for small a or has an excess high energy tail for 
large a. 



To implement regularization, we maximize the function 

Q(X, W) = \nLH- \ E U E - \ L U L , 



(22) 



where \ E and \l are positive parameters that adjust the degree of smoothness of the DF 
and and H L are dimensionless positive-definite functions of the DF. There are several 
natural forms for U E , including 



He = 

n £ = 
n £ = 



( d\nf \ 2 \ 
\d{E/E c )) / 

d 2 In/ 



d{E/E c f 
d 2 In/ 



d\n(E/E c y 



uniform form, 
exponential form, 
power-law form, 



(23) 
(24) 
(25) 



where E c is a normalizing constant. The names indicate what functions minimize II^; for 
example, the uniform form is zero if / = const, the exponential form is zero if / oc exp(— f3E), 
where (3 is a constant, etc. The angle brackets denote an average over area in E — L/L C (E) 
space; for example, if we use a fixed energy interval for all bins, the exponential form is 
proportional to 



^2 I ln f(m+i)n - 2 In f mn + In / ( 



m— l)n | j 



(26) 



in case no V mn is zero. The term 11^ smoothing the DF along the L direction can be written 
similarly. 
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Before we choose the form of U E and Il L , it is helpful to discuss the meaning of the 
regularization term. Equation (22) is equivalent to 

e Q = e- XEliE - XLUL LH(^J). (27) 

Therefore, maximizing Q is equivalent to maximizing the posterior probability if the prior 
probability for / is 

P(f\H f ) = e -^u E -\ L ii L ( 2g ) 

By choosing and n^, we are assuming some prior information Hf, i.e., a specific degree 
of smoothness and shape of the DF. Therefore, the smoothing terms assess the physical 
plausibility of the DF (Cox & O'Sullivan 1990). 

Generally, DFs of GCs are strongly non-uniform in the sense that they may change 
amplitude by a few orders of magnitude over the observable range. For example, given a 
logarithmic potential and a GC density profile 

$ = 2cr 2 mr, z/(r)ocr" 3 , (29) 

the corresponding isotropic DF is 



f(E) oce- 3E / 2a2 . (30) 

Over the range R < R < R±, the DF at zero velocity will vary by a factor of (Ri/R ) 3 , 
which is about 3000, given that the ratio R\/R$ is 110 kpc/7 kpc ~ 15 in the case of M87 
(see §4). 

Given this large variation, neither entropy maximization as in equation (21) nor the 
uniform form of the regularization term in equation (23) is appropriate, because they tend 
to favor a uniform DF over the entire phase space. The exponential and power-law forms in 
equations (24) and (25) give more freedom and favor exponential or power-law forms for the 
DF, which are more physically reasonable. We adopt the exponential form throughout this 
paper. 

For convenience, we define 

Q maa; (X)=maxQ(W,X), (31) 
w 

which is Q maximized with respect to the weights W. The function Q(W, X) usually has a 
number of local maxima in W at a fixed set of potential parameters X. We attempt to find 
the global maximum in this landscape using simulated annealing and the downhill simplex 
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method, as discussed in Press et al. (2003). The method does not guarantee that we have 
found the global maximum, so we try several different initial values of W and choose the 
largest maximum found from any of these initial conditions. The optimization process turns 
out to consume most of the computational resources required by this method. 

We may also assume some prior information H x for the potential parameters, which 
gives a prior probability P(K\H X ). For example, in the power-law potential (eq. 8), it is 
reasonable to assume a uniform prior in (lnp , a), thus we have 

P(X\H X ) = Po \ (32) 

Then the probability distribution of the potential parameters X is 

P(X) oc P(~X\H x )e Qma *W. (33) 



We may also incorporate the observed surface number density profile Yt b s (Rj) of the 
GCs and its error a-s(Rj) as an additional constraint in our calculation since this may be 
available for many more GCs, over a larger survey area than have measured velocities. We 
maximize the quantity 

Q' = \nLH-^-X E U E -X L U L: (34) 



X 



. ■ (35) 

H(R) oc J g(R,v z )dv z 

= ^2,w mn J g mn (R,v z )dv z , (36) 

m,n 

where both E(i?) and S of)S (i?) should be normalized so that 2n f RdRT,(R) and 2n f RdRlu obs (R) 
are unity within the survey limit of GC number counting. 

How should we choose A^ and Al? If these parameters are set too small, they do not 
smooth the DF at all. If they are set too large, the DF is forced to a specific functional 
form that may not be demanded by the data. Notice that in contrast to many problems 
such as least-squares fitting (Press et al. 2003) the likelihood is only known to within a 
multiplicative constant, so the goodness of fit can only be assessed in a relative sense. When 
maximizing Q' (eq. 34), we choose smoothing parameters following Rix et al. (1997). Assume 
that the minimum x 2 i n equation (35) is Xo i n case °f no regularization. Then we increase 
the smoothing parameters so that the minimum x 2 is Xo + n s> where n s corresponds to 
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n s — a error. We call these "appropriate" smoothing parameters. A degeneracy arises in 
choosing two smoothing parameters from one requirement. The two parameters Xe and 
\l control smoothness in the direction of E and L, respectively and are not necessarily 
equal. Fortunately, we find that all "appropriate" pairs of smoothing parameters give similar 
results for derived potentials and energy distributions, although, not surprisingly, the derived 
anisotropy can be very different depending on the value of A^. Of course, with more data, 
the values of smoothing parameters will be subject to tighter constraints. Therefore, we may 
recover the anisotropy if enough data is available. 



in a power-law potential (eq. 8) with tq = 19 kpc and (po,a) = (1.9 x 10 7 M Q kpc -3 , 1.9). 
The energy E is in units of (100 km s -1 ) 2 . The anisotropy parameter 7 is a constant which 
we choose in the range from —5 (extremely radial) to 5 (extremely tangential). The DF is 
chosen so that the energy distribution is bimodal, to challenge the algorithm. We generate 
160 kinematic data points within the projected radius range 7-32 kpc and a surface number 
density profile from 7000 GCs within the projected radius range 7-110 kpc (these match the 
properties of the sample of M87 GCs described in §4.1). 

In this case, X = {p ,a}. Figure 1 shows the contour plot for Q ma x(X-) for three 
simulations with 7 = —5, and 5. The appropriate smoothing parameters are (Xe,^l) = 
(0.015, 0.015) so that n s — 1. The plus signs are the best-fit model which gives the maximum 
Qmax(X-) while the asterisk signs mark the input models. The contours are n — a levels from 
the peak, i.e., e - " 2 / 2 of the maximum likelihood. We run 20 simulations with 7 between —5 
and 5 and find that the standard errors are 10% for p an d 0.2 for a. These tests show that 
we can recover the potential parameters from this dataset, even for extremely anisotropic 



Figure 2 shows the input and derived energy distributions of the GCs, described by 



3. 



Simulations 



We first test our algorithm using simulated data. We choose a DF 




(37) 



DFs. 




(38) 



i.e., the sum of the weights over all angular-momentum bins at a given energy. The derived 
energy distributions (dotted lines) fit the input distributions (solid lines) reasonably well for 
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the nearly radial (7 = 5) and isotropic (7 = 0) DFs. However, the derived DF is shifted 
to low energies for the nearly radial DF (7 = —5), probably because the relatively large 
errors (see Figure 1) in the derived potential parameters lead to a shift of zero-energy point. 
Nevertheless, the bimodality of the DF is successfully recovered. 

To determine whether our method can recover the anisotropy of the DF from such a 
small data set, we define an indicator 

where f(E) is the DF averaged over phase space at energy E, 

f° +SE f(E,L)d^v 



f(E) 



j* +&E d^dH 

Xm w mn 



(40) 



Figure 3 shows the weighted indicator of anisotropy 

J(L/L C ) = J n = J2l mn U m , (41) 

m 

which is a constant for an isotropic DF. The solid lines show the input and the dotted lines 
show the derived anisotropy indicators associated with the best-fit models in Figure 1. For 
the isotropic (7 = 0, dotted line) and tangential (7 = 5) DFs, we recover the anisotropy 
correctly, but not for the radial DF (7 = —5) and some simulations with an isotropic DF 
(7 = 0, dashed line). Our simulations show that there is a high probability (5 out of 20 runs) 
of failing to recover the anisotropy correctly. Therefore, we can not get a reliable result for 
anisotropy from a small (~160 velocities) data set with these parameters. 

Our simulations suggest that we can recover the dependence of the DF on energy E 
reasonably well with a sample of this size but we can not reliably recover the dependence of 
DF on angular momentum L. This is probably because the observable distribution g(R,v z ) 
depends more strongly on how GCs are distributed in energy than angular momentum. Of 
course, with more data, we may constrain the anisotropy of the DF more tightly. 



4. Mass distribution of M87 

4.1. A uniform dataset 

Surveys of GC positions and radial velocities for M87 are described in Section 1. The 
largest survey of GC radial velocities is given by Cote et al. (2001), who list 278 GCs. 
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However, they are drawn from a variety of sources. To construct a uniform sample, we use 
only the data in Cohen and Ryzhov (1997): these are drawn from a survey by Strom et 
al. (1981), which was complete only in the radius range 90"-405". So we have the survey 
limits (R s o, R s \) = (7, 32) kpc (see Figure 4). Restricting our sample to this range, we have 
161 GCs, which are shown in Figure 5. The errors in radial velocities given by Cohen and 
Ryzhov (1997) range between 50-100 km s -1 ; we set the errors equal to 75 km s _1 for all 
GCs. 

We also incorporate the surface number density profile H b s (Rj), which is given with 
estimated error a-s(Rj) in 25 bins from 7 to 110 kpc by Harris (1986) and McLaughlin 
(1999a), as an additional constraint in our calculation, as described in §2 . Both T,(R) and 
S fes {R) should be normalized so that 2n J RdRT,(R) and 2n f RdRT, obs (R) are unity within 
the radius range 7-110 kpc. Notice that we do double count some of the GCs in 7-32 kpc 
since they appear in both the kinematic data [Ri,v Z i] and in the surface density distribution 
Yl (Rj). However, this does not affect our results in a noticeable way because there are over 
7000 GCs used in the calculation of the surface density distribution. 

4.2. Analysis 

We have described our method in §2. In constructing our models we must choose a 
maximum energy En e (eq. 17) for the cluster population, or equivalently, a largest apocenter 
r m such that $(r m ) = En e . We choose r m = 300 kpc, 9 times larger than the outer survey 
radius R s ± = 32 kpc. If r m is too small, we bias our results by excluding high-energy GCs 
that may be present within the survey radii. If r m is too large, the numerical work will be 
dominated by GCs that spend only a small fraction of their orbits within the survey radii. 
For comparison, Harris (1986) observed GCs out to projected radius R = 110 kpc. 

4.3. Results 

4-3.1. Isotropic power-law models 

Initially we assume that the DF is isotropic (Ne = 80, Nl = 1) and the mass density 
profile of the galaxy is a power law (eq. 8). We set the arbitrary radius r in equation (8) 
to be 19 kpc, near the median radius of our samples, to minimize the covariance of p and 
a in their bivariate probability distribution. 

In this case, X = {po,a}. Figures 6a,b show the contour plot of Q' max (X.) and the 
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resulting best fit to the surface number density profile. The appropriate smoothing param- 
eters are (A^, A^) = (0.015,0.015) so that n s = 2 as described in §2. We get a very tight 
constraint on the potential if the DF is assumed isotropic. The isotropic models give M(32 
kpc) = (2.6 ±0.3) x 1O 12 M and a = 1.8 ±0.2. The best fit value of a is consistent with the 
value 1.7 found by Romanowsky & Kochanek (2001) for similar isotropic models. Figure 7 
shows the derived energy distribution of the GCs. The dotted line (bottom) is the energy 
distribution of GCs within projected radius R = 7 — 32 kpc, which is calculated from the 
DF associated with the best-fit model in Figure 6a. The upper dashed line is the energy dis- 
tribution of all GCs with R>7 kpc in the same model. The error bars are estimated from 
bootstrap resampling. The energy distributions suggest that there may be kinematically 
distinct groups of GCs in M87. 



4-3.2. Anisotropic power-law models 

Figures 8a and 8b show the result of fitting models with power-law potentials and 
anisotropic DFs (N E = 40, N L = 5) to both the kinematic data and the observed surface 
number density, using both A#, \l = and the appropriate smoothing parameters (A^, \l) = 
(0.0025, 0.60) so that n s = 2, respectively. The contours in Figure 8a (without regularization) 
and Figure 8b (with appropriate smoothing parameters) are very similar, suggesting that 
the estimate of potential parameters of M87 is rather insensitive to smoothing parameters. 
The contour in Figure 8b is more irregular and the uncertainties are larger than in Figure 
6a, which is expected since anisotropy gives more freedom to adjust the DF to fit the data; 
however, the best-fit potential parameters (po, «) = (1-9 x 10 7 M G kpc~ 3 , 1.9) (plus symbol) 
do not change from those for isotropic models. A secondary peak, marked by the asterisk 
symbol, has Q' smaller by 0.2. The derived surface number density from the best-fit model in 
Figure 8c has a stronger tail than that derived under the assumption of isotropy (see Figure 
6b). Figure 8d shows the energy distributions of GCs within projected radius R = 7 — 32 
kpc (solid line) and with R > 7 kpc (dotted line), which are also concentrated into a few 
peaks. The peak at E ~ 440 x (100 km s -1 ) 2 accounts for the strong tail in the derived 
surface number density profile beyond 100 kpc. 

These results are based on the estimate of background stars 5.8 ± 0.3 stars arcmin -2 
(Harris 1986). We also investigate the case of 6.1 stars arcmin -2 . We get almost the same 
likelihood contours and the best-fit potential parameters change by less than 1 — a; however, 
the amplitude of the high-energy tail is reduced by a factor of 2. 

Figure 9 shows the indicators of anisotropy I(E,L) and J(L/L C ). The results suggest 
that the GCs prefer circular orbits (high angular momentum) for the best-fit model but 
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radial orbits (low angular momentum) at the secondary peak. To reliably determine the 
anisotropy of the GCs, we need to obtain more data to give a tighter constraint on mass 
distribution and develop a more robust formalism of choosing smoothing parameters in the 
future. 

To estimate the probability distribution of parameters derived from the potential, we 
lay down points in Figure 8b with a uniform probability distribution in (In p Q , a), then reject 
or save each point according to Q'(X,W) at that point. From the saved points, we may 
estimate the distribution of M(r), the mass enclosed within radius r, and the exponent 
a. Figure 10a shows the probability distribution of M(r = 32 kpc), which is derived from 
4 x 10 4 points and has been normalized to its maximum. Figure 10b shows the probability 
distribution of the density slope a. The data provide a fairly tight constraint on a, which is 
1.6 ±0.4 (l-cr error). 

If we stack curves like that in Figure 10a for different radii, we get the mass distribution 
for all radii, which is shown in Figure 11a. The solid curves mark the mean and 1,2,3-cr 
errors. The solid line and error bars in Figure lib show the best estimate of M(r) and its 
standard deviation, which gives M(32 kpc) = (2.4 ± 0.6) x 10 12 M , differing from that in 
isotropic models by less than the error bars. 

Notice that the relative error is the smallest around 32 kpc, which is reasonable since 
it is the maximum survey radius. The mass profile can be approximately described as an 
analytical form, 

M(r) = (2.3r 1 - 36 )!g : ^;^:;; x 10 10 M , 15 kpc < r < 110 kpc, (42) 
where r is in units of kpc. 

Also shown in Figure lib are previously published models. The data for X-ray ob- 
servations (Nulsen & Bohringer 1995) were obtained by assuming a distance of 20 Mpc to 
M87. The long dashed lines show their derived lower and higher limits scaled by a factor 
of 0.815 since we assume a distance of 16.3 Mpc. Our mass estimate is consistent with the 
X-ray observations in the range where they overlap. Our result is also consistent with X-ray 
observation by Matsushita et al. (2002) within 20% from 15 to 80 kpc by reading numbers 
from the double (3 model fit in their Figure 21. The estimate by McLaughlin (1999b) is 
based on X-ray observations and an assumption of a NFW dark matter halo of the Virgo 
cluster. The estimates by Merritt (1993) and Cohen and Ryzhov (1997) are based on the 
assumption of an isotropic DF. 

Figure 12 shows the derived velocity dispersion profile (dotted line) 
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which is consistent with the observed velocity dispersion profile (solid line with error bars). 

4.3.3. Other models 

We also investigate models with two other potentials, (i) a NFW profile (eq. 9), (ii) the 
potential generated by the stars, assuming constant mass-to-light ratio. We adopt the IB- 
band luminosity profile in McLaughlin (1999b), which is smaller than the luminosity profile 
in Romanowsky & Kochanek (2001) by 23%, probably due to a systematic shift between 
different data they use. 

The dotted line in Figure lib shows the estimate for M(r) for anisotropic NFW models, 
which is almost the same as that for power-law models beyond 20 kpc. The errors are 
comparable to those for power-law models. We explore NFW models with the concentration 
radius r c ranging from 10 to 500 kpc. If the DF is assumed isotropic, we get r c = 26^4 7 kpc, 
while models with r c = 100 and 500 kpc are ruled out at 2 and 3 — a level, respectively. 
However, the best-fit anisotropic NFW models with 10 kpc < r c < 500 kpc give Q' max within 
2 — cr, i.e., a wide range of r c can fit the data pretty well. The models NFW1, NFW2 and 
NFW3 of Romanowsky & Kochanek (2001) are within l-a relative to our own best fit. 

The constant mass-to-light ratio model gives (M/L)b = 125±1OM Q /L , which is much 
larger than the value 43±3M /L obtained by Romanowsky & Kochanek (2001) fitting the 
globular cluster kinematics. The difference in the B-band luminosity profiles can not account 
for the large gap. For the same regularization level, the maximum penalized likelihood for the 
best-fit powerlaw and NFW models differ by less than l-a while the maximum for constant 
M/L is smaller by 7, i.e., the constant M/L model is about A-a worse than for the NFW or 
power-law models. 

5. Discussion 

The mass distribution of M87 from GCs has been studied by Merritt & Tremblay (1993); 
Cohen and Ryzhov (1997); Romanowsky & Kochanek (2001). In general our results are 
consistent with the conclusions of these earlier papers. Our results are more robust than 
those by Merritt & Tremblay (1993) and Cohen and Ryzhov (1997) since we assume a two- 
integral DF rather than an isotropic DF. In some aspects, our analysis is less powerful than 
that of Romanowsky & Kochanek (2001), because we do not use the stellar part to constrain 
the mass distribution of M87. Therefore, it is not surprising that we constrain the mass 
distribution around the effective radius (R e = 7 kpc) rather poorly (see Figure 11). As to 
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the analysis of GC kinematics, our methods and results differ from those in Romanowsky & 
Kochanek (2001) in the following aspects. 

1. Romanowsky & Kochanek (2001) fit their kinematic data to predictions for the proba- 
bility distribution of radial velocities at given radii, rather than to the joint probability 
distribution of radius and velocity as we do. This difference in approaches is nec- 
essary because they use an incomplete sample of GCs, whereas ours is complete for 
R G (R s o, Rsi) = (7, 32) kpc. Our approach is statistically more powerful but requires 
us to use a subset of the GCs with measured radial velocities. 

2. We use regularization rather than entropy to smooth our results. As we have argued 
in §2, entropy smoothing tends to favor high-energy orbits. Romanowsky & Kochanek 
(2001) use entropy only as a numerical device to accelerate convergence to a final 
solution, eventually reducing the entropy contribution to a negligible value by reducing 
the constant a in equation (20) to a very small value. Nevertheless, this process may 
still tend to bias the solution towards high-energy orbits. 

3. Although Romanowsky & Kochanek (2001) explore both the best fit and the uncer- 
tainties in their mass models, they do not rigorously explore the range of DFs and 
velocity anisotropies that are consistent with the data. We find that either radial or 
tangential anisotropy in the DF are allowed by the data. We also explore a wider range 
of mass models. 

The observations of metallicity of GCs show that there are metal-rich and metal-poor 
globular cluster samples in M87. The metal-rich sample is more concentrated to the center 
(Cote et al. 2001). We tried to divide all GCs into two samples to study their kinematics 
individually. Ideally it would be interesting to see how the subsamples contribute to the 
peaks observed in the energy distribution in Figure 7 and 8d. However, the present data set 
is too small so that the statistics is too poor to reach a conclusion. 

An important question is whether the mass distribution we have measured is a dark 
halo associated with M87 or with the Virgo cluster as whole. If the DF is isotropic, we find 
that r c = 26^4 ? kpc, which implies that the halo is small enough that it must be associated 
with M87. However, for general DFs all halos with 10 < r c < 500 kpc fit the data, so we 
cannot determine whether the halo is associated with the galaxy or its host cluster. 
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6. Conclusions 

We have investigated how to determine the mass distribution of a spherical stellar system 
from the kinematics of a discrete set of test particles. We use a non-parametric form for 
the DF, which includes the isotropic DF as a special case, and parametrized forms for the 
potential, and find the best-fit potential and DF using a penalized maximum likelihood 
method. Our method allows for selection effects, observational and statistical errors, and 
anisotropy in the DF. We find that the potential parameters are determined more securely 
than the energy distribution, which is in turn more accurate than the angular-momentum 
distribution. Our simulations show that mass distributions depending on two parameters 
can be derived rather accurately from a small number (< 200) of data pairs [Ri,v Z i] and 
surface number density profile. Plausible estimates of the energy distribution of the test 
particles can also be recovered; however, we can not tell whether the DFs are isotropic or 
not from such a small dataset. 

We have applied our method to a sample of 161 GCs between 7 kpc and 32 kpc in M87. 
Under the assumption that the system is spherical and the DF is isotropic, we infer that 
the mass of M87 within 32 kpc is 2.6 ± 0.3 x 10 12 M Q . The power-law index for the density 
profile, p oc r~ a is a = 1.8 ± 0.2. The energy distribution suggests a few kinematically 
distinct groups of globular clusters. If we allow an anisotropic DF, the mass of M87 within 
32 kpc is (2.4 ± 0.6) x 1O 12 M and the power-law index is 1.6 ± 0.4, within 1 standard 
deviation of the results for an isotropic DF. Anisotropic models fitted to an NFW potential 
yield similar mass estimates but cannot constrain the concentration radius r c in the range 
10 — 500 kpc although r c is as small as 26l 27 kpc in isotropic NFW models. Assuming a 
constant mass-to-light ratio, the derived M/L in the B-band is 125 ± 1OM Q /L , but the 
model fit is substantially worse than for other mass models, and the derived mass-to-light 
ratio is far larger than any plausible stellar population. Our methods and results are more 
rigorous than earlier attempts to measure the mass of M87 from GC kinematics (Merritt & 
Tremblay 1993; Cohen and Ryzhov 1997) and differ from those in Romanowsky & Kochanek 
(2001) as discussed in §5. 

We find that our mass estimates are insensitive to the smoothing parameters. Smoothing 
has stronger effects on the DF, which is determined less reliably. The energy distribution has 
multiple peaks for the chosen smoothing parameters, perhaps suggesting a few kinematically 
distinct groups of GCs in M87. The derived anisotropy of the GCs can not be determined 
at this stage. With more data, it may be possible to measure the anisotropy of the velocity 
distribution and to test models of GC evolution predicting that GCs on less-eccentric orbits 
are more likely to survive (e.g. Gnedin & Ostriker 1997). 

A more statistically robust method of choosing smoothing parameters, more kinematic 
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data and a more accurate surface number density, especially beyond 110 kpc are the most 
important theoretical and observational advances needed to determine the mass distribution 
in M87 more accurately by this method. For example, the ongoing ACS Virgo cluster survey 
(Cote et al. 2004) will observe 20,000 globular clusters in the Virgo cluster, which may allow 
us to put tighter constraints on the mass density profiles of M87 and other member galaxies. 

We thank Judy Cohen for providing the data in Figures 4 and 5 in tabular form, Michael 
Strauss, David Spergel, James Gunn and Robert Lupton for valuable discussions, and the 
referee for suggestions that significantly improved the paper. This research was supported 
in part by NASA grant NNG04GL47G and used computational facilities supported by NSF 
grant AST-0216105. 



A. Calculation of g mn (R,v z ) 

We divide E — L space into N E x rectangular bins. Assume that the bin mn is 
defined by 

E £ [E m -i,E m ], 

L e [L„_i,L„]. (Al) 

We wish to determine the normalized distribution g m n{R,v z ) in projected radius and ra- 
dial velocity corresponding to a uniform phase-space density in bin mn (eq. 18). For an 
anisotropic DF, this requires a three-dimensional integration over v x , v y and z, which is 
computationally expensive. We may achieve considerate speedup by integrating analytically 
over v y (recall from eq. 4 that we assume that each GC lies on the x-axis). We have two 
constraints on v y from inequalities (Al). Based on equation (4), we have 

v 2 y G [A, B] 

ee [2(E m ^ - $(r)) - {vl + v 2 z ), 2(E m - $(r)) - (v 2 x + v 2 z )} , (A2) 
v 2 y e [C,D] 

' L 2 n -i ~ (v z R-v x z) 2 L\ - (y z R-y x z) 2 } 
~ (R 2 + z 2 ) (R 2 + z 2 ) I ' 1 ' 

Then we may set 



/ ee max(min(i?, D),0) J = max(max(A, C), 0), 



(A4) 
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so that we have 

/,^ m a x ((V7-V7),0). (A5) 
This algorithm speeds up the integration by a factor of 20 or so. 
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Fig. 1. — Tests of the recovery of the potential parameters X = {p , a} and the DFs, using 
160 data points with radial velocities and 7000 data points to define the surface density 
profile in a power-law potential (eq. 8). The input parameters are marked by the asterisk 
signs. The input DFs are described by equation (37) with anisotropy parameter 7 = — 5 
(radial), (isotropic) and 5 (tangential). Equation (24) is used for regularization. The plus 
signs mark the best-fit model, which gives the maximum Q' max (X.) (eq. 34). The contours 
are n — a away from the peak. The tests show that we recover the potential parameters 
pretty well even for extremely anisotropic DF. 
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Fig. 2. — Derived energy distributions associated with the best-fit models in Figure 1. The 
derived energy distributions (dotted lines) successfully recover the bimodality of the input 
energy distributions (solid lines) for all cases. The shift between the input and derived 
energy distributions for the radial DF (7 = —5) is related to the relatively large errors in 
derived potential parameters, which shift the zero-energy point. The results show that we 
can recover the energy distribution from a small (~160 velocities) data set reasonably well. 
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Fig. 3. — Indicators of anisotropy J(L/L C ) (eq. 41). The solid lines show the input and the 
dotted lines show the derived anisotropy indicators associated with the best-fit models in 
Figure 1. For the tangential (7 = 5) DF, we recover the anisotropy correctly, but not for the 
radial DF (7 = —5). Some simulations with an isotropic DF (7 = 0, dashed lines) also fail 
to recover the correct anisotropy. Therefore, we can not get a reliable result for anisotropy 
from a data set of this kind. 
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Fig. 4. — Spatial distribution of GC candidates in Cohen and Ryzhov (1997), which are 
taken from the photometric survey by Strom et al. (1981). To eliminate selection effects, we 
only use data between the two circles, which are 90"-405"or 7-32 kpc from the center, where 
the Strom et al. survey was complete. 
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Fig. 5. — The data pairs [Ri, v z i\ from Cohen and Ryzhov (1997) that we use to construct a 
uniform dataset. The data outside the survey radii (dotted lines) are excluded. 
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Fig. 6. — Fitting the kinematic data [Ri, v zi ] and the observed surface number density 
E ft s (i?) of GCs in M87 to a power-law potential, assuming an isotropic DF. (a) Contour plot 
of Q' max (X.) as a function of X = {p ,a}. (b) The derived surface number density (dotted 
line) from maximizing Q'(X., W), compared to the observed S (i?) (solid line). The surface 
densities are normalized so that the total number within the range 7-110 kpc is unity. The 
kinematic survey limits are marked by dashed lines. 
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Fig. 7. — Energy distribution with the isotropic assumption: the energy weights W as 
determined by fitting both the kinematic data and the surface density profile, as in Figure 
6. The bottom dotted line refers to the energy distribution of GCs within projected radius 
R = 7 — 32 kpc and the upper dashed line to all GCs with R > 7 kpc. The error bars are 
estimated from bootstrap resampling. 
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Fig. 8. — Fitting the kinematic data [R iy v zi ] and the observed surface number density S obs (i?) 
to a power-law potential without assuming isotropy of the DF. (a) The same as Figure 6a 
except for anisotropic models without regularization. (b) The same as Figure 8a except for 
anisotropic models with appropriate smoothing parameters. The plus and asterisk mark the 
best-fit model and a secondary peak, respectively, (c) The same as Figure 6b except for the 
best-fit anisotropic model with appropriate smoothing parameters, (d) Energy distribution 
of GCs within projected radius R = 7 — 32 kpc (solid line) and with R>7 kpc (dotted line). 
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Fig. 9.— Indicators of anisotropy for X E , \ L = 0.005. (a) I(E, L) (eq. 39). (b) J(L/L C ) 
(eq. 41) for the best-fit model (solid line) and at the secondary peak (dotted line) in Figure 
8b. The behavior of the indicators suggests that the GCs may either prefer circular orbits 
(high angular momentum) or radial orbits (low angular momentum), depending on the mass 
distribution model of the galaxy. 
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Fig. 10. — Distribution of M(r = 32 kpc) and a for likelihood distribution of Figure 8b. (a) 
The probability distribution of M(r =32 kpc) for the power-law (solid line) and NFW models 
(dotted line), (b) Probability function of a for the power-law models. It gives a — 1.6 ± 0.4. 
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Fig. 11. — Estimates of M(r), derived from Monte Carlo points laid down on Figure 8b. (a) 
The probability distribution of M(r). The lines show the median and 1,2,3-cx errors which 
enclose the regions with the cumulative probabilities of 68.3%, 95.4%, 99.7% for the power- 
law models (solid lines) and the NFW models (dotted lines), (b) The solid and dotted lines 
show the estimates for M(r) by assuming power-law and NFW density profiles, respectively. 
The error bars indicate the uncertainties in the mass estimates, which are similar for both 
models. The other models are: Merritt & Tremblay (1993) (marked by "M"), Cohen and 
Ryzhov (1997) (marked by "C"), upper and lower limits from X-ray observation by Nulsen 
& Bohringer (1995) (long dashed lines), McLaughlin (1999b) (dot-dashed line). The best-fit 
model NFW2 (star+halo) in Romanowsky & Kochanek (2001) is within 5% of our best-fit 
power-law model from 7 kpc to 110 kpc, thus is not shown for clarity. However, considering 
the large error bars (> 30%) in our models, it is probably a coincidence that the two results 
are so close. Our result is also consistent with X-ray observation by Matsushita et al. (2002) 
within 20% from 15 to 80 kpc by reading numbers from the double (3 model fit in their Figure 
21. 
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Fig. 12. — Velocity dispersion, including assumed observational errors of 75 kms -1 in the 
theoretical model. The dotted line is derived from the best-fit anisotropic power-law model, 
which is consistent with the observations (solid line with error bars). 



